# Initialize Notebook
%run library/init.ipy
HTML('''<script> code_show=true; function code_toggle() { if (code_show){ $('div.input').hide(); } else { $('div.input').show(); } code_show = !code_show } $( document ).ready(code_toggle); </script> <form action="javascript:code_toggle()"><input type="submit" value="Toggle Code"></form>''')
This notebook contains an analysis of GEO dataset GSE76023 (https://www.ncbi.nlm.nih.gov/gds/?term=GSE76023) created using the BioJupies Generator.
The notebook is divided into the following sections:
Here, the GEO dataset GSE76023 is loaded into the notebook. Expression data was quantified as gene-level counts using the ARCHS4 pipeline (Lachmann et al., 2017), available at http://amp.pharm.mssm.edu/archs4/.
# Load dataset
dataset = load_dataset(source='archs4', gse='GSE76023', platform='GPL11154')
# Preview expression data
dataset['rawdata'].head()
GSM1972961 | GSM1972956 | GSM1972957 | GSM1972954 | GSM1972955 | GSM1972958 | GSM1972959 | GSM1972960 | GSM1972963 | GSM1972962 | GSM1972967 | GSM1972966 | GSM1972965 | GSM1972964 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
A1BG | 57 | 625 | 564 | 603 | 761 | 641 | 542 | 31 | 150 | 75 | 153 | 111 | 112 | 121 |
A1CF | 1 | 48 | 36 | 45 | 49 | 73 | 15 | 1 | 10 | 4 | 18 | 18 | 21 | 20 |
A2M | 5 | 122 | 78 | 159 | 167 | 917 | 510 | 3 | 130 | 116 | 37 | 49 | 16 | 12 |
A2ML1 | 223 | 952 | 355 | 846 | 1083 | 346 | 298 | 256 | 568 | 449 | 409 | 252 | 292 | 246 |
A2MP1 | 3 | 31 | 23 | 38 | 38 | 73 | 16 | 2 | 11 | 9 | 15 | 18 | 13 | 20 |
Table 1 | RNA-seq expression data. The table displays the first 5 rows of the quantified RNA-seq expression dataset. Rows represent genes, columns represent samples, and values show the number of mapped reads.
# Display metadata
display_metadata(dataset)
Sample Title | cell line | treatment | |
---|---|---|---|
Sample_geo_accession | |||
GSM1972961 | hESC_siControl+RA_replicate2 | H9 | Control treated with Retinoic Acid (RA) for 4 ... |
GSM1972956 | hESC_shControl_replicate3 | H9 | Control |
GSM1972957 | hESC_shLncRNA_replicate1 | H9 | lncRNA Knock Down |
GSM1972954 | hESC_shControl_replicate1 | H9 | Control |
GSM1972955 | hESC_shControl_replicate2 | H9 | Control |
GSM1972958 | hESC_shLncRNA_replicate2 | H9 | lncRNA Knock Down |
GSM1972959 | hESC_shLncRNA_replicate3 | H9 | lncRNA Knock Down |
GSM1972960 | hESC_siControl+RA_replicate1 | H9 | Control treated with Retinoic Acid (RA) for 4 ... |
GSM1972963 | hESC_siControl_replicate2 | H9 | Control |
GSM1972962 | hESC_siControl_replicate1 | H9 | Control |
GSM1972967 | hESC_siTP53_replicate2 | H9 | TP53 Knock Down |
GSM1972966 | hESC_siTP53_replicate1 | H9 | TP53 Knock Down |
GSM1972965 | hESC_siTP53+RA_replicate2 | H9 | TP53 Knock Down treated with Retinoic Acid (RA... |
GSM1972964 | hESC_siTP53+RA_replicate1 | H9 | TP53 Knock Down treated with Retinoic Acid (RA... |
Table 2 | Sample metadata. The table displays the metadata associated with the samples in the RNA-seq dataset. Rows represent RNA-seq samples, columns represent metadata categories.
# Configure signatures
dataset['signature_metadata'] = {
'siControl vs siRNA': {
'A': ['GSM1972954', 'GSM1972955', 'GSM1972956', 'GSM1972960', 'GSM1972961', 'GSM1972962', 'GSM1972963'],
'B': ['GSM1972957', 'GSM1972958', 'GSM1972959', 'GSM1972964', 'GSM1972965', 'GSM1972966', 'GSM1972967']
}
}
# Generate signatures
for label, groups in dataset['signature_metadata'].items():
signatures[label] = generate_signature(group_A=groups['A'], group_B=groups['B'], method='limma', dataset=dataset)
Principal Component Analysis (PCA) is a statistical technique used to identify global patterns in high-dimensional datasets. It is commonly used to explore the similarity of biological samples in RNA-seq datasets. To achieve this, gene expression values are transformed into Principal Components (PCs), a set of linearly uncorrelated features which represent the most relevant sources of variance in the data, and subsequently visualized using a scatter plot.
# Run analysis
results['pca'] = analyze(dataset=dataset, tool='pca', nr_genes=2500, normalization='logCPM', z_score='True')
# Display results
plot(results['pca'])
** Figure 1 | Principal Component Analysis results. ** The figure displays an interactive, three-dimensional scatter plot of the first three Principal Components (PCs) of the data. Each point represents an RNA-seq sample. Samples with similar gene expression profiles are closer in the three-dimensional space. If provided, sample groups are indicated using different colors, allowing for easier interpretation of the results.
Clustergrammer is a web-based tool for visualizing and analyzing high-dimensional data as interactive and hierarchically clustered heatmaps. It is commonly used to explore the similarity between samples in an RNA-seq dataset. In addition to identifying clusters of samples, it also allows to identify the genes which contribute to the clustering.
# Run analysis
results['clustergrammer'] = analyze(dataset=dataset, tool='clustergrammer', nr_genes=2500, normalization='logCPM', z_score='True')
# Display results
plot(results['clustergrammer'])
** Figure 2 | Clustergrammer analysis. **The figure contains an interactive heatmap displaying gene expression for each sample in the RNA-seq dataset. Every row of the heatmap represents a gene, every column represents a sample, and every cell displays normalized gene expression values. The heatmap additionally features color bars beside each column which represent prior knowledge of each sample, such as the tissue of origin or experimental treatment.
In order to quantify gene expression in an RNA-seq dataset, reads generated from the sequencing step are mapped to a reference genome and subsequently aggregated into numeric gene counts. Due to experimental variations and random technical noise, samples in an RNA-seq datasets often have variable amounts of the total RNA. Library size analysis calculates and displays the total number of reads mapped for each sample in the RNA-seq dataset, facilitating the identification of outlying samples and the assessment of the overall quality of the data.
# Run analysis
results['library_size_analysis'] = analyze(dataset=dataset, tool='library_size_analysis')
# Display results
plot(results['library_size_analysis'])
** Figure 3 | Library Size Analysis results. **The figure contains an interactive bar chart which displays the total number of reads mapped to each RNA-seq sample in the dataset. Additional information for each sample is available by hovering over the bars. If provided, sample groups are indicated using different colors, thus allowing for easier interpretation of the results
Gene expression signatures are alterations in the patterns of gene expression that occur as a result of cellular perturbations such as drug treatments, gene knock-downs or diseases. They can be quantified using differential gene expression (DGE) methods, which compare gene expression between two groups of samples to identify genes whose expression is significantly altered in the perturbation. The signature table is used to interactively display the results of such analyses.
# Initialize results
results['signature_table'] = {}
# Loop through signatures
for label, signature in signatures.items():
# Run analysis
results['signature_table'][label] = analyze(signature=signature, tool='signature_table', signature_label=label)
# Display results
plot(results['signature_table'][label])
logFC | AveExpr | P-value | FDR | |
---|---|---|---|---|
Gene | ||||
*BNIP3 | -1.38 | 5.93 | 4.589204e-07 | 0.016171 |
*EGLN1 | -1.39 | 4.38 | 2.229788e-06 | 0.039287 |
TEK | -1.45 | 4.14 | 6.830855e-06 | 0.070206 |
SYT11 | 1.60 | 3.69 | 7.969307e-06 | 0.070206 |
SMS | -1.20 | 8.04 | 1.547031e-05 | 0.105668 |
LRRIQ4 | -2.85 | -4.24 | 1.799223e-05 | 0.105668 |
ZNF732 | -3.71 | 0.82 | 3.104663e-05 | 0.135925 |
CLDN1 | -1.65 | 1.97 | 3.131367e-05 | 0.135925 |
ZNF680 | -4.17 | 2.08 | 3.641696e-05 | 0.135925 |
A2ML1 | -0.98 | 3.21 | 3.857353e-05 | 0.135925 |
CHMP1B2P | 2.19 | 1.33 | 5.005413e-05 | 0.153510 |
DDIT4 | -1.36 | 5.90 | 5.227664e-05 | 0.153510 |
MYOCD | -2.87 | 1.13 | 9.285398e-05 | 0.248501 |
ARHGAP40 | -1.70 | -0.86 | 9.872922e-05 | 0.248501 |
ANKRD37 | -1.78 | 0.94 | 1.336163e-04 | 0.294369 |
NPPB | -4.01 | -1.82 | 1.336598e-04 | 0.294369 |
LDHAL6FP | 2.53 | -4.06 | 1.498579e-04 | 0.310629 |
BEX5 | -2.28 | -0.48 | 1.701411e-04 | 0.311160 |
ZNF667 | 1.99 | 2.36 | 1.783920e-04 | 0.311160 |
SPATA31A5 | 3.96 | -3.86 | 1.805070e-04 | 0.311160 |
FOXD1 | -1.91 | -0.72 | 1.900143e-04 | 0.311160 |
PGK1 | -0.79 | 9.13 | 1.942655e-04 | 0.311160 |
S100A10 | -1.26 | 5.18 | 2.132083e-04 | 0.312459 |
ALOXE3 | -2.30 | -2.34 | 2.203366e-04 | 0.312459 |
TCEAL5 | 1.52 | -0.76 | 2.270673e-04 | 0.312459 |
ALX1 | -1.72 | 0.42 | 2.391184e-04 | 0.312459 |
RP11-164C12.1 | 1.80 | -5.18 | 2.400245e-04 | 0.312459 |
C17ORF51 | -3.41 | 0.75 | 2.495581e-04 | 0.312459 |
RGS5 | -1.87 | 4.40 | 2.571458e-04 | 0.312459 |
SLC27A6 | -1.00 | 2.50 | 2.757710e-04 | 0.323921 |
ADCYAP1 | 3.01 | -3.96 | 2.972895e-04 | 0.337932 |
CCDC105 | -1.60 | -5.73 | 3.166825e-04 | 0.348727 |
LDHA | -1.33 | 8.84 | 3.443217e-04 | 0.367673 |
C16ORF54 | 1.32 | 2.49 | 3.677980e-04 | 0.377844 |
FGFR1 | 0.55 | 9.39 | 3.752922e-04 | 0.377844 |
PRKX | -0.86 | 6.85 | 4.548420e-04 | 0.443738 |
RP11-478C6.5 | -3.28 | -3.45 | 4.732157e-04 | 0.443738 |
LAMB3 | -1.83 | -1.01 | 4.856862e-04 | 0.443738 |
RP11-452N17.1 | -4.70 | -3.90 | 4.928919e-04 | 0.443738 |
TES | -0.84 | 5.00 | 5.379305e-04 | 0.443738 |
FFAR3 | -1.30 | -5.85 | 5.548237e-04 | 0.443738 |
RP4-614C10.1 | -1.45 | -5.77 | 5.588798e-04 | 0.443738 |
OR9A3P | 1.98 | -4.50 | 5.596973e-04 | 0.443738 |
C11ORF45 | -1.59 | -0.68 | 5.759625e-04 | 0.443738 |
FBXW12 | 1.13 | 1.19 | 6.098803e-04 | 0.443738 |
FAM72C | -0.99 | 3.67 | 6.146733e-04 | 0.443738 |
SFXN3 | -1.04 | 1.54 | 6.276078e-04 | 0.443738 |
LYPD6B | -1.19 | 0.90 | 6.465206e-04 | 0.443738 |
SSXP10 | 3.17 | -3.24 | 6.702067e-04 | 0.443738 |
EPS8L2 | -0.78 | 4.40 | 6.837556e-04 | 0.443738 |
THBS1 | -1.55 | 6.44 | 6.848606e-04 | 0.443738 |
OTOP2 | -2.69 | -4.78 | 6.865404e-04 | 0.443738 |
KRT17 | -2.97 | -2.28 | 6.918509e-04 | 0.443738 |
RP11-195E2.1 | 2.86 | -3.57 | 6.961920e-04 | 0.443738 |
TAGLN2 | -0.91 | 5.86 | 7.062242e-04 | 0.443738 |
CAT | -2.82 | 1.80 | 7.138099e-04 | 0.443738 |
RPL12P30 | 1.85 | -4.87 | 7.587010e-04 | 0.443738 |
COL11A2 | -1.09 | 2.19 | 7.636107e-04 | 0.443738 |
GABBR2 | -1.34 | 1.98 | 7.822274e-04 | 0.443738 |
OR14A16 | -1.13 | -5.96 | 8.129828e-04 | 0.443738 |
IL32 | -1.54 | 1.14 | 8.382889e-04 | 0.443738 |
REPS2 | -0.83 | 3.58 | 8.426418e-04 | 0.443738 |
C4ORF47 | -1.64 | 0.47 | 8.445356e-04 | 0.443738 |
CH507-152C13.3 | 2.14 | -4.15 | 8.450843e-04 | 0.443738 |
FGF11 | -0.65 | 5.09 | 8.794874e-04 | 0.443738 |
PALLD | -1.08 | 5.37 | 8.799509e-04 | 0.443738 |
KRT9 | -2.11 | -5.34 | 9.053891e-04 | 0.443738 |
LYPD6 | -0.74 | 3.54 | 9.079779e-04 | 0.443738 |
H1F0 | -0.80 | 7.54 | 9.183412e-04 | 0.443738 |
FAAH | 1.10 | 3.63 | 9.301756e-04 | 0.443738 |
SLC17A8 | -2.31 | -3.70 | 9.309911e-04 | 0.443738 |
PABPC4L | -1.47 | 1.87 | 9.544448e-04 | 0.443738 |
HK2 | -0.67 | 6.75 | 9.691204e-04 | 0.443738 |
CER1 | -2.09 | -0.13 | 9.892299e-04 | 0.443738 |
SEC14L4 | 1.28 | -0.38 | 9.957197e-04 | 0.443738 |
MST1P2 | -1.11 | -0.36 | 1.000185e-03 | 0.443738 |
AFAP1L2 | -1.57 | 2.61 | 1.006632e-03 | 0.443738 |
RP11-475I24.9 | 2.07 | -3.16 | 1.015847e-03 | 0.443738 |
MYL7 | -2.29 | -0.22 | 1.032060e-03 | 0.443738 |
FLVCR1 | 0.67 | 5.97 | 1.039717e-03 | 0.443738 |
P4HA1 | -0.98 | 5.79 | 1.046038e-03 | 0.443738 |
DNM1 | 0.72 | 5.40 | 1.066345e-03 | 0.443738 |
RP11-366M4.11 | -1.97 | 0.98 | 1.074233e-03 | 0.443738 |
AC012512.1 | -1.75 | -0.04 | 1.074879e-03 | 0.443738 |
IQCA1 | -0.96 | 3.75 | 1.077756e-03 | 0.443738 |
FBLN5 | -1.04 | 0.94 | 1.087867e-03 | 0.443738 |
BLCAP | 0.52 | 5.37 | 1.095555e-03 | 0.443738 |
RP11-495P10.9 | -2.61 | -4.20 | 1.116928e-03 | 0.445183 |
GPX5 | -2.04 | -5.22 | 1.124392e-03 | 0.445183 |
ZNF785 | 0.64 | 3.71 | 1.144775e-03 | 0.448218 |
EN2 | 3.65 | -3.42 | 1.217064e-03 | 0.467745 |
DUOXA1 | -1.55 | -1.69 | 1.230742e-03 | 0.467745 |
LRRC4 | 1.38 | 3.48 | 1.234752e-03 | 0.467745 |
CPNE4 | -0.88 | 1.10 | 1.271898e-03 | 0.467745 |
STARD13 | -0.73 | 2.25 | 1.280567e-03 | 0.467745 |
C9ORF64 | -1.75 | 2.81 | 1.285778e-03 | 0.467745 |
RP5-1100H13.3 | -1.53 | -5.81 | 1.287568e-03 | 0.467745 |
RP5-877J2.1 | -1.60 | -5.67 | 1.305208e-03 | 0.469315 |
CHMP4C | -1.32 | 1.44 | 1.348082e-03 | 0.478209 |
SERPINB6 | -0.68 | 4.53 | 1.357085e-03 | 0.478209 |
** Table 3 | Differential Expression Table.** The figure displays a browsable table containing the gene expression signature generated from a differential gene expression analysis. Every row of the table represents a gene; the columns display the estimated measures of differential expression. Links to external resources containing additional information for each gene are also provided
Volcano plots are a type of scatter plot commonly used to display the results of a differential gene expression analysis. They can be used to quickly identify genes whose expression is significantly altered in a perturbation, and to assess the global similarity of gene expression in two groups of biological samples. Each point in the scatter plot represents a gene; the axes display the significance versus fold-change estimated by the differential expression analysis.
# Initialize results
results['volcano_plot'] = {}
# Loop through signatures
for label, signature in signatures.items():
# Run analysis
results['volcano_plot'][label] = analyze(signature=signature, tool='volcano_plot', signature_label=label, pvalue_threshold=0.05, logfc_threshold=1.5)
# Display results
plot(results['volcano_plot'][label])
** Figure 4 | Volcano Plot. **The figure contains an interactive scatter plot which displays the log2-fold changes and statistical significance of each gene calculated by performing a differential gene expression analysis. Every point in the plot represents a gene. Red points indicate significantly up-regulated genes, blue points indicate down-regulated genes. Additional information for each gene is available by hovering over it.
Volcano plots are a type of scatter plot commonly used to display the results of a differential gene expression analysis. They can be used to quickly identify genes whose expression is significantly altered in a perturbation, and to assess the global similarity of gene expression in two groups of biological samples. Each point in the scatter plot represents a gene; the axes display the average gene expression versus fold-change estimated by the differential expression analysis.
# Initialize results
results['ma_plot'] = {}
# Loop through signatures
for label, signature in signatures.items():
# Run analysis
results['ma_plot'][label] = analyze(signature=signature, tool='ma_plot', signature_label=label, pvalue_threshold=0.05, logfc_threshold=1.5)
# Display results
plot(results['ma_plot'][label])
** Figure 5 | MA Plot. **The figure contains an interactive scatter plot which displays the average expression and statistical significance of each gene calculated by performing differential gene expression analysis. Every point in the plot represents a gene. Red points indicate significantly up-regulated genes, blue points indicate down-regulated genes. Additional information for each gene is available by hovering over it.
Enrichment analysis is a statistical procedure used to identify biological terms which are over-represented in a given gene set. These include signaling pathways, molecular functions, diseases, and a wide variety of other biological terms obtained by integrating prior knowledge of gene function from multiple resources. Enrichr is a web-based application which allows to perform enrichment analysis using a large collection of gene-set libraries and various interactive approaches to display enrichment results.
# Initialize results
results['enrichr'] = {}
# Loop through signatures
for label, signature in signatures.items():
# Run analysis
results['enrichr'][label] = analyze(signature=signature, tool='enrichr', signature_label=label, geneset_size=500)
# Display results
plot(results['enrichr'][label])
** Table 4 | Enrichr links. **The table displays links to Enrichr containing the results of enrichment analyses generated by analyzing the up-regulated and down-regulated genes from a differential expression analysis. By clicking on these links, users can interactively explore and download the enrichment results from the Enrichr website
Gene Ontology (GO) is a major bioinformatics initiative aimed at unifying the representation of gene attributes across all species. It contains a large collection of experimentally validated and predicted associations between genes and biological terms. This information can be leveraged by Enrichr to identify the biological processes, molecular functions and cellular components which are over-represented in the up-regulated and down-regulated genes identified by comparing two groups of samples.
# Initialize results
results['go_enrichment'] = {}
# Loop through results
for label, enrichr_results in results['enrichr'].items():
# Run analysis
results['go_enrichment'][label] = analyze(enrichr_results=enrichr_results['results'], tool='go_enrichment', signature_label=label)
# Display results
plot(results['go_enrichment'][label])
** Figure 6 | Gene Ontology Enrichment Analysis Results. **The figure contains interactive bar charts displaying the results of the Gene Ontology enrichment analysis generated using Enrichr. The x axis indicates the enrichment score for each term. Significant terms are highlighted in bold. Additional information about enrichment results is available by hovering over each bar
Biological pathways are sequences of interactions between biochemical compounds which play a key role in determining cellular behavior. Databases such as KEGG, Reactome and WikiPathways contain a large number of associations between such pathways and genes. This information can be leveraged by Enrichr to identify the biological pathways which are over-represented in the up-regulated and down-regulated genes identified by comparing two groups of samples.
# Initialize results
results['pathway_enrichment'] = {}
# Loop through results
for label, enrichr_results in results['enrichr'].items():
# Run analysis
results['pathway_enrichment'][label] = analyze(enrichr_results=enrichr_results['results'], tool='pathway_enrichment', signature_label=label)
# Display results
plot(results['pathway_enrichment'][label])
** Figure 7 | Pathway Enrichment Analysis Results.** The figure contains interactive bar charts displaying the results of the pathway enrichment analysis generated using Enrichr. The x axis indicates the enrichment score for each term. Significant terms are highlighted in bold. Additional information about enrichment results is available by hovering over each bar.
Transcription Factors (TFs) are proteins involved in the transcriptional regulation of gene expression. Databases such as ChEA and ENCODE contain a large number of associations between TFs and their transcriptional targets. This information can be leveraged by Enrichr to identify the transcription factors whose targets are over-represented in the up-regulated and down-regulated genes identified by comparing two groups of samples.
# Initialize results
results['tf_enrichment'] = {}
# Loop through results
for label, enrichr_results in results['enrichr'].items():
# Run analysis
results['tf_enrichment'][label] = analyze(enrichr_results=enrichr_results['results'], tool='tf_enrichment', signature_label=label)
# Display results
plot(results['tf_enrichment'][label])
Rank | Transcription Factor | P-value | FDR | Target |
---|---|---|---|---|
1 | SUZ12 | 0.000488 | 0.305163 | 47 downregulated targets |
2 | BMI1 | 0.001177 | 0.367937 | 43 downregulated targets |
3 | EZH2 | 0.002960 | 0.616589 | 36 downregulated targets |
4 | CBX2 | 0.020278 | 1.000000 | 31 upregulated targets |
5 | JARID2 | 0.022361 | 1.000000 | 39 upregulated targets |
6 | EED | 0.044348 | 1.000000 | 29 upregulated targets |
7 | TP53 | 0.052473 | 1.000000 | 37 upregulated targets |
8 | PHC1 | 0.058499 | 1.000000 | 31 upregulated targets |
9 | RNF2 | 0.106732 | 1.000000 | 33 upregulated targets |
10 | POU5F1 | 0.215899 | 1.000000 | 19 upregulated targets |
11 | ERG | 0.283907 | 1.000000 | 10 upregulated targets |
12 | TP63 | 0.291228 | 1.000000 | 6 upregulated targets |
13 | CIITA | 0.436027 | 1.000000 | 2 upregulated targets |
14 | TAL1 | 0.452968 | 1.000000 | 2 upregulated targets |
15 | TCF21 | 0.608439 | 1.000000 | 1 upregulated targets |
16 | HOXA2 | 0.618246 | 1.000000 | 1 upregulated targets |
17 | TRIM28 | 0.661864 | 1.000000 | 2 upregulated targets |
18 | IKZF1 | 0.747824 | 1.000000 | 3 upregulated targets |
19 | RING1B | 0.748248 | 1.000000 | 46 upregulated targets |
20 | CTNNB1 | 0.774014 | 1.000000 | 3 upregulated targets |
21 | SOX9 | 0.802680 | 1.000000 | 1 upregulated targets |
22 | IRF8 | 0.812134 | 1.000000 | 6 upregulated targets |
23 | PRDM16 | 0.812825 | 1.000000 | 2 upregulated targets |
24 | MYCN | 0.839949 | 1.000000 | 4 upregulated targets |
25 | KLF4 | 0.851418 | 1.000000 | 32 upregulated targets |
26 | TAF15 | 0.868596 | 1.000000 | 4 upregulated targets |
27 | WT1 | 0.877679 | 1.000000 | 3 upregulated targets |
28 | BCAT | 0.891861 | 1.000000 | 12 upregulated targets |
29 | GATA2 | 0.892782 | 1.000000 | 1 upregulated targets |
30 | ZFP322A | 0.898099 | 1.000000 | 1 upregulated targets |
31 | ESR1 | 0.916862 | 1.000000 | 1 upregulated targets |
32 | VDR | 0.918984 | 1.000000 | 8 upregulated targets |
33 | KLF2 | 0.926795 | 1.000000 | 1 upregulated targets |
34 | KLF5 | 0.926795 | 1.000000 | 1 upregulated targets |
35 | SALL1 | 0.930428 | 1.000000 | 1 upregulated targets |
36 | IRF1 | 0.935010 | 1.000000 | 5 upregulated targets |
37 | CEBPD | 0.938331 | 1.000000 | 8 upregulated targets |
38 | ZNF652 | 0.941784 | 1.000000 | 1 upregulated targets |
39 | TEAD4 | 0.942005 | 1.000000 | 9 upregulated targets |
40 | AR | 0.950859 | 1.000000 | 2 upregulated targets |
41 | TCF4 | 0.951373 | 1.000000 | 7 upregulated targets |
42 | HOXD13 | 0.958520 | 1.000000 | 2 upregulated targets |
43 | IGF1R | 0.962245 | 1.000000 | 1 upregulated targets |
44 | FUS | 0.963771 | 1.000000 | 8 upregulated targets |
45 | DMRT1 | 0.965024 | 1.000000 | 1 upregulated targets |
46 | ZNF274 | 0.965072 | 1.000000 | 4 upregulated targets |
47 | ELF1 | 0.965904 | 1.000000 | 1 upregulated targets |
48 | NR4A2 | 0.967209 | 1.000000 | 2 upregulated targets |
49 | STAT1 | 0.968255 | 1.000000 | 10 upregulated targets |
50 | PCGF2 | 0.971024 | 1.000000 | 6 upregulated targets |
Rank | Transcription Factor | P-value | FDR | Target |
---|---|---|---|---|
1 | SUZ12* | 8.104805e-07 | 0.000624 | 35 upregulated targets |
2 | FOSL1 | 2.913719e-01 | 1.000000 | 11 downregulated targets |
3 | GABPA | 3.811643e-01 | 1.000000 | 4 downregulated targets |
4 | EP300 | 4.452247e-01 | 1.000000 | 5 downregulated targets |
5 | STAT5A | 4.594903e-01 | 1.000000 | 6 downregulated targets |
6 | JUN | 4.997884e-01 | 1.000000 | 13 downregulated targets |
7 | ZEB1 | 5.076277e-01 | 1.000000 | 5 downregulated targets |
8 | NR2F2 | 5.082270e-01 | 1.000000 | 4 downregulated targets |
9 | MAFK | 5.354835e-01 | 1.000000 | 5 downregulated targets |
10 | TCF7L2 | 6.160310e-01 | 1.000000 | 13 downregulated targets |
11 | JUND | 6.239895e-01 | 1.000000 | 6 downregulated targets |
12 | REST | 6.363081e-01 | 1.000000 | 16 downregulated targets |
13 | TCF12 | 6.567392e-01 | 1.000000 | 20 downregulated targets |
14 | RXRA | 6.789487e-01 | 1.000000 | 3 downregulated targets |
15 | NR3C1 | 7.096907e-01 | 1.000000 | 5 downregulated targets |
16 | FOSL2 | 7.186941e-01 | 1.000000 | 7 downregulated targets |
17 | ESR1 | 7.263139e-01 | 1.000000 | 14 downregulated targets |
18 | CBX2 | 7.482475e-01 | 1.000000 | 46 downregulated targets |
19 | EZH2 | 7.482475e-01 | 1.000000 | 46 downregulated targets |
20 | CEBPB | 7.722236e-01 | 1.000000 | 6 downregulated targets |
21 | NANOG | 7.778610e-01 | 1.000000 | 2 downregulated targets |
22 | ESRRA | 7.860925e-01 | 1.000000 | 2 downregulated targets |
23 | BCL11A | 7.940548e-01 | 1.000000 | 2 downregulated targets |
24 | FOXM1 | 7.946520e-01 | 1.000000 | 3 downregulated targets |
25 | NELFE | 8.443775e-01 | 1.000000 | 6 downregulated targets |
26 | FOXA2 | 8.719878e-01 | 1.000000 | 6 downregulated targets |
27 | HSF1 | 8.725893e-01 | 1.000000 | 4 downregulated targets |
28 | ZKSCAN1 | 8.764772e-01 | 1.000000 | 4 downregulated targets |
29 | ZC3H11A | 8.810919e-01 | 1.000000 | 15 downregulated targets |
30 | TEAD4 | 8.922340e-01 | 1.000000 | 2 downregulated targets |
31 | CBX3 | 8.972620e-01 | 1.000000 | 3 downregulated targets |
32 | NFIC | 9.453749e-01 | 1.000000 | 11 downregulated targets |
33 | GATA3 | 9.456574e-01 | 1.000000 | 22 downregulated targets |
34 | CBX8 | 9.471758e-01 | 1.000000 | 40 downregulated targets |
35 | POLR3A | 9.548442e-01 | 1.000000 | 2 downregulated targets |
36 | HDAC2 | 9.556503e-01 | 1.000000 | 3 downregulated targets |
37 | RFX5 | 9.566407e-01 | 1.000000 | 7 downregulated targets |
38 | RAD21 | 9.573528e-01 | 1.000000 | 19 downregulated targets |
39 | MEF2A | 9.576790e-01 | 1.000000 | 4 downregulated targets |
40 | RELA | 9.587878e-01 | 1.000000 | 13 downregulated targets |
41 | HNF4G | 9.684542e-01 | 1.000000 | 9 downregulated targets |
42 | CTCF | 9.758291e-01 | 1.000000 | 21 downregulated targets |
43 | STAT3 | 9.779912e-01 | 1.000000 | 37 downregulated targets |
44 | STAT1 | 9.800504e-01 | 1.000000 | 15 downregulated targets |
45 | GATA2 | 9.822087e-01 | 1.000000 | 5 downregulated targets |
46 | FOS | 9.836004e-01 | 1.000000 | 13 downregulated targets |
47 | BATF | 9.842673e-01 | 1.000000 | 11 downregulated targets |
48 | BCL3 | 9.875384e-01 | 1.000000 | 6 downregulated targets |
49 | NFE2 | 9.943888e-01 | 1.000000 | 2 downregulated targets |
50 | ZZZ3 | 9.951004e-01 | 1.000000 | 2 downregulated targets |
Rank | Transcription Factor | P-value | FDR | Target |
---|---|---|---|---|
1 | GATA6* | 1.541629e-07 | 0.000207 | 25 downregulated targets |
2 | GRHL3* | 2.261381e-05 | 0.010131 | 21 downregulated targets |
3 | ZNF750* | 2.261381e-05 | 0.010131 | 21 downregulated targets |
4 | POU3F4* | 6.974144e-06 | 0.010182 | 22 upregulated targets |
5 | OVOL1* | 6.975492e-05 | 0.023438 | 20 downregulated targets |
6 | TP63* | 2.043108e-04 | 0.039228 | 19 downregulated targets |
7 | DSP* | 2.043108e-04 | 0.039228 | 19 downregulated targets |
8 | FOXN1* | 2.043108e-04 | 0.039228 | 19 downregulated targets |
9 | BMP2 | 5.671016e-04 | 0.084687 | 18 downregulated targets |
10 | IRF6 | 5.671016e-04 | 0.084687 | 18 downregulated targets |
11 | EN2 | 2.043108e-04 | 0.149147 | 19 upregulated targets |
12 | GSC | 1.488534e-03 | 0.181872 | 17 downregulated targets |
13 | SOX17 | 1.488534e-03 | 0.181872 | 17 downregulated targets |
14 | GRHL1 | 3.686336e-03 | 0.353888 | 16 downregulated targets |
15 | PLEK2 | 3.686336e-03 | 0.353888 | 16 downregulated targets |
16 | EHF | 3.686336e-03 | 0.353888 | 16 downregulated targets |
17 | TBX22 | 1.488534e-03 | 0.538205 | 17 upregulated targets |
18 | SOX1 | 1.488534e-03 | 0.538205 | 17 upregulated targets |
19 | SOX30 | 3.686336e-03 | 0.538205 | 16 upregulated targets |
20 | ASCL1 | 3.686336e-03 | 0.538205 | 16 upregulated targets |
21 | LHX1 | 3.686336e-03 | 0.538205 | 16 upregulated targets |
22 | SP8 | 3.686336e-03 | 0.538205 | 16 upregulated targets |
23 | POU3F2 | 3.686336e-03 | 0.538205 | 16 upregulated targets |
24 | PAX3 | 3.686336e-03 | 0.538205 | 16 upregulated targets |
25 | LHX2 | 8.592411e-03 | 0.570224 | 15 upregulated targets |
26 | NEUROG3 | 8.592411e-03 | 0.570224 | 15 upregulated targets |
27 | HES5 | 8.592411e-03 | 0.570224 | 15 upregulated targets |
28 | FOXA2 | 8.592411e-03 | 0.570224 | 15 upregulated targets |
29 | POU4F1 | 8.592411e-03 | 0.570224 | 15 upregulated targets |
30 | OTX1 | 8.592411e-03 | 0.570224 | 15 upregulated targets |
31 | IRX2 | 8.592411e-03 | 0.570224 | 15 upregulated targets |
32 | NHLH2 | 8.592411e-03 | 0.570224 | 15 upregulated targets |
33 | DACH1 | 8.592411e-03 | 0.570224 | 15 upregulated targets |
34 | NEUROD4 | 8.592411e-03 | 0.570224 | 15 upregulated targets |
35 | NR2F1 | 8.592411e-03 | 0.570224 | 15 upregulated targets |
36 | POU3F3 | 8.592411e-03 | 0.570224 | 15 upregulated targets |
37 | FOXD4L1 | 8.592411e-03 | 0.721763 | 15 downregulated targets |
38 | BCL6B | 8.592411e-03 | 0.721763 | 15 downregulated targets |
39 | ALX3 | 1.880198e-02 | 0.946582 | 14 upregulated targets |
40 | ZNF479 | 1.880198e-02 | 0.946582 | 14 upregulated targets |
41 | ZNF618 | 1.880198e-02 | 0.946582 | 14 upregulated targets |
42 | OTP | 1.880198e-02 | 0.946582 | 14 upregulated targets |
43 | INSM1 | 1.880198e-02 | 0.946582 | 14 upregulated targets |
44 | SOX21 | 1.880198e-02 | 0.946582 | 14 upregulated targets |
45 | IRX1 | 1.880198e-02 | 0.946582 | 14 upregulated targets |
46 | OTOP3 | 1.880198e-02 | 0.999513 | 14 downregulated targets |
47 | HNF1B | 1.880198e-02 | 0.999513 | 14 downregulated targets |
48 | PPP1R13L | 1.880198e-02 | 0.999513 | 14 downregulated targets |
49 | EVX1 | 3.852079e-02 | 0.999513 | 13 upregulated targets |
50 | RHOXF1 | 3.852079e-02 | 0.999513 | 13 upregulated targets |
** Table 5 | Transcription Factor Enrichment Analysis Results. **The figure contains scrollable tables displaying the results of the Transcription Factor (TF) enrichment analysis generated using Enrichr. Every row represents a TF; significant TFs are highlighted in bold. A and B display results generated using ChEA and ENCODE libraries, indicating TFs whose experimentally validated targets are enriched. C displays results generated using the ARCHS4 library, indicating TFs whose top coexpressed genes (according to the ARCHS4 dataset) are enriched.
Raw RNA-seq data for GEO dataset GSE76023 was downloaded from the SRA database (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE76023) and quantified to gene-level counts using the ARCHS4 pipeline (Lachmann et al., 2017). Gene counts were downloaded from the ARCHS4 gene expression matrix v1.1. For more information about ARCHS4, as well as free access to the quantified gene expression matrix, visit the project home page at the following URL: http://amp.pharm.mssm.edu/archs4/download.html.
Raw counts were normalized to log10-Counts Per Million (logCPM) by dividing each column by the total sum of its counts, multiplying it by 106, followed by the application of a log10-transform.
The gene expression signature was generated by comparing gene expression levels between the control group and the experimental group using the limma R package (Ritchie et al., Nucleic Acids Research 2015), available on Bioconductor: http://bioconductor.org/packages/release/bioc/html/limma.html.
Principal Component Analysis was performed using the PCA function from in the sklearn Python module. Prior to performing PCA, the raw gene counts were normalized using the logCPM method, filtered by selecting the 2500 genes with most variable expression, and finally transformed using the Z-score method.
The interactive heatmap was generated using Clustergrammer (Fernandez et al., 2017) which is freely available at http://amp.pharm.mssm.edu/clustergrammer/. Prior to displaying the heatmap, the raw gene counts were normalized using the logCPM method, filtered by selecting the 2500 genes with most variable expression, and finally transformed using the Z-score method.
Read counts were calculated by performing the sum for each column in the raw gene count matrix. Total counts were subsequently divided by 106 and displayed as million reads.
The gene expression signature was generated by performing differential gene expression analysis using the methods described in the Differential Gene Expression section.
Gene fold changes were transformed using log2 and displayed on the x axis; P-values were corrected using the Benjamini-Hochberg method, transformed using –log10, and displayed on the y axis. See the Differential Gene Expression section for more information on the methods used to generate these values.
Average gene expression was identified by calculating the mean of the normalized gene expression values and displayed on the x axis; P-values were corrected using the Benjamini-Hochberg method, transformed using –log10, and displayed on the y axis. For more information on the methods used to generate the signature, see the Differential Gene Expression section.
The up-regulated and down-regulated gene sets were generated by extracting the 500 genes with the respectively highest and lowest values from the gene expression signature. The gene sets were subsequently submitted to Enrichr (Kuleshov et al., 2016), which is freely available at http://amp.pharm.mssm.edu/Enrichr/, using the gene set upload API. For more information on the methods used to generate the signature, see the Differential Gene Expression section.
Enrichment results were generated by analyzing the up-regulated and down-regulated gene sets using Enrichr. The following libraries were used for the analysis: GO_Biological_Process_2017b, GO_Molecular_Function_2017b, GO_Cellular_Component_2017b. Significant terms are determined by using a cut-off of p-value<0.1 after applying Benjamini-Hochberg correction. For more information on the methods used to perform the enrichment analysis, see the Enrichr section.
Enrichment results were generated by analyzing the up-regulated and down-regulated gene sets using Enrichr. The following libraries were used for the analysis: KEGG_2016, Reactome_2016, WikiPathways_2016. Significant terms are determined by using a cut-off of p-value<0.1 after applying Benjamini-Hochberg correction. For more information on the methods used to perform the enrichment analysis, see the Enrichr section.
Enrichment results were generated by analyzing the up-regulated and down-regulated gene sets using Enrichr. The following libraries were used for the analysis: ChEA_2016, ENCODE_TF_ChIP-seq_2015, ARCHS4_TFs_Coexp. Significant results are determined by using a cut-off of p-value<0.1 after applying Benjamini-Hochberg correction. For more information on the methods used to perform the enrichment analysis, see the Enrichr section.
Fernandez, N.F., Gundersen, G.W., Rahman, A., Grimes, M.L., Rikova, K., Hornbeck, P., and Ma'ayan, A. (2017). Clustergrammer, a web-based heatmap visualization and analysis tool for high-dimensional biological data. Scientific Data 4, 170151. doi: http://dx.doi.org/10.1038/sdata.2017.151
Kuleshov, M.V., Jones, M.R., Rouillard, A.D., Fernandez, N.F., Duan, Q., Wang, Z., Koplev, S., Jenkins, S.L., Jagodnik, K.M., Lachmann, A., et al. (2016). Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Research 44, W90ÐW97. doi: https://dx.doi.org/10.1093/nar/gkw377
Lachmann, A., Torre, D., Keenan, A.B., Jagodnik, K.M., Lee, H.J., Silverstein, M.C., Wang, L., and Ma’ayan, A. (2017). Massive Mining of Publicly Available RNA-seq Data from Human and Mouse (Cold Spring Harbor Laboratory). doi: https://doi.org/10.1101/189092
Pearson, K. (1901). LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2, 559Ð572. doi: https://doi.org/10.1080/14786440109462720
Ritchie, M.E., Phipson, B., Wu, D., Hu, Y., Law, C.W., Shi, W., and Smyth, G.K. (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43, e47–e47. doi: https://doi.org/10.1093/nar/gkv007